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Abstract 

The advances in 1C process make future chip multiprocessors (CMPs) more and more vulnerable to 
transient faults. To detect transient faults, previous core-level schemes provide redundancy for each core 
separately. As a result, they may leave transient faults in the uncore parts, which consume over 50% area 
of a modern CMP, escaped from detection. 

This paper proposes RepTFD, the first core-level transient fault detection scheme with 100% cov- 
erage. Instead of providing redundancy for each core separately, RepTFD provides redundancy for a 
group of cores as a whole. To be specific, it replays the execution of the checked group of cores on a 
redundant group of cores. Through comparing the execution results between the two groups of cores, 
all malignant transient faults can be caught. Moreover, RepTFD adopts a novel pending period based 
record-replay approach, which can greatly reduce the number of execution orders that need to be en- 
forced in the replay-run. Hence, RepTFD brings only 4.76% performance overhead in comparison to the 
normal execution without fault-tolerance according to our experiments on the RTL design of an indus- 
trial CMP named Godson-3. In addition, RepTFD only consumes about 0.83% area of Godson-3, while 
needing only trivial modifications to existing components of Godson-3. 



1 Introduction 



1.1 Motivation 



The operating of a transistor on a chip may meet a transient fault due to various reasons, including envi- 
ronment interference, power supply noise, high-energy particle, and so on. Considering the ever increasing 
number of transistors in commercial chip multiprocessors (CMPs), CMPs are more and more vulnerable to 
transient faults. Although a transient fault happens only one time and does not appear again in the future ex- 
ecution of a CMP, the error resulted from a transient fault may propagate to other parts of the CMP, causing 
incorrect instruction results, even system crash. 

A traditional methodology to detect transient faults is to use a redundant core to assist each checked core 
separately [9l[T8j[29j[T4l|25]]. A redundant core has the same program and inputs with the corresponding 
checked core, thus these two cores should have the same instruction results. Once an instruction on the 
redundant core has a different result with the corresponding instruction on the checked core, a transient fault 
is detected. In such a case, both cores should be rolled back to some previous checkpoint using existing 
checkpointing mechanism |[20ll30l . and re-execute the program to recover from the fault. 
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Figure 1: Coverage problems of previous schemes, (a) represents previous schemes where a fault in the 
uncore parts of the chip (last-level cache (LLC), network on chip (NoC), memory controller, and so on) may 
probably escape from detection. And (b) represents our scheme which can successfully detect all the faults. 



Although previous core-level schemes (providing redundancy for each core separately) can effectively 
detect the transient faults happening in cores, when coping with parallel programs, they may omit the tran- 
sient faults happening in the uncore parts of the CMP, including last-level cache (LLC), network on chip 
(NoC), memory controller, and so on. An illustrative example can be found in Figure [T](a), where core Pq 
is the redundancy of core Pq, and core P[ is the redundancy of core P\. Assume that core Pi has correctly 
stored a value in its local LI cache. When the correct value stored by core Pi is evicted from core Pi to LLC 
or memory, it may be corrupted due to a transient fault in the uncore parts, which makes all subsequent loads 
to the value be faulty. Once core Po and core Pq load this faulty value from the uncore, they will produce 
the same incorrect results. Obviously, such a transient fault happening in the uncore parts can be detected 
through neither the result comparison between Po and Pq nor the result comparison between Pi and P[. 
Considering that the uncore parts may consume 50-70% area of a state-of-the-art commercial CMP | 
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there is an urgent need for a scheme which can comprehensively detect transient faults on the whole chi 



1.2 Our Idea 

Previous core-level schemes (which provide redundancy for each core separately) fail to detect all transient 
faults since the checked core and the corresponding redundant core can load data from a same source. As a 
result, a transient fault, which affects the common data source of both the checked core and the correspond- 
ing redundant core, will escape from detection. Hence, to get 100% coverage for transient fault detection, 
our idea is to provide redundancy for a group of cores as a whole. As shown in Figure [T](b), core Pq and 
core Pi are treated as the checked group of cores, while core Pq and core P[ belong to the redundant group 
of cores. There is no data dependency between different groups of cores, say, core Po and core Pi do not 
rely on any datum produced by the redundant group (core Pq and core P{), while core Pq and core P[ do not 
rely on any datum produced by the checked group (core Pq and core Pi). When a transient fault occurs, it 
cannot affect both groups of cores simultaneously. For example, if some transient fault happens in the data 
transfer from core Pi to core Po, core Pq will not be affected by this fault. Hence, through comparing the 
instruction results between core Pq and core Pq, we are able to detect the fault. 
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Figure 2: RepTFD overview. 



Based on the above idea, in this paper we propose RepTFD, a transient fault detection scheme based on 
hardware-assisted deterministic replay]^] As shown in Figure|2](a), RepTFD uses one half of the cores in the 
CMP as the checked group of cores, and uses the other half of the cores as the redundant group of cores. 
These two groups of cores execute a same parallel program without data dependency. In the execution on the 
checked group of cores (first-run), two types of information will be recorded. One is the information of the 
orders between the memory instructions in the first-run, which is recorded in a determinism-log file; the other 
is the instruction results of the first-run, which is recorded in a result-log file. Based on the determinism-log, 
the first-run can be deterministically replayed on the redundant group of cores. In the meantime, based 
on the the result-log, the correctness of the first-run can be checked against the replay-run on the redundant 
group of cores. Since any transient fault can only affect one group of cores, through comparing the results of 
the corresponding instructions on the two groups, RepTFD can detect malignant transient faults with 100% 
coverage (here we say that a transient fault is malignant if it leads to an incorrect instruction result). 

'Though one can use error correcting codes (ECC) to improve the reliability of the uncore parts, it will remarkably increase 
the area of the uncore parts (both cache and NoC). Furthermore, ECC cannot cope with many common misbehaviors in the uncore 
parts (e.g., transfer losing, mistransfers, transfer misordering, incorrect prefetching, and so on). 

deterministic replay aims at guaranteeing two executions of a same program to exhibit the same in the presence of non- 
deterministic factors. The general workflow of deterministic replay consists of two stages: At the first stage, an execution of the 
program, called the first-run, is carried out, where the non-deterministic factors are continuously recorded as logs. At the second 
stage, a follow-up execution of the program, called the replay-run, recurs the first-run under the guidance of the recorded logs. 
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Besides the coverage of transient faults, the performance overhead is also a crucial criterion for a tran- 
sient fault detection scheme. As shown in Figure |2](b), the execution of a program with fault-tolerance 
is divided into many segments using existing checkpoint mechanism. The replay-run of each segment is 
executed after the execution of the first-run of this segment. Obviously, the overall performance overhead 
of replay based transient fault detection depends on the slower one of the first-run and the replay-run. Since 
hardware-assisted deterministic replay approaches can easily achieve negligible slowdown (say, < 1%) in 
the first-run (7J [321 [TO), the overall performance overhead of RepTFD is determined by the speed of the 
replay-run. Unfortunately, most existing hardware-assisted deterministic replay approaches focus on re- 
ducing the recorded log size to record long execution of parallel program^] thus often cause remarkable 
slowdown in the replay-run (> 18% to our best knowledge 1171 1321 fT71 l4l). 

Hence, instead of directly incorporating an existing hardware-assisted deterministic replay approaches 
into RepTFD, we propose a new deterministic replay approach, which has negligible slowdown of the 
replay-run, based on the following observation: the more execution order^ directly recorded, the slower 
the replay-run is. For example, if an execution order u — > v is recorded in the determinism-log, then in 
the replay-run, before v is executed, we must detect and wait for the completion of u, to guarantee the 
identicalness between the replay-run and the first-run. 

Hence, RepTFD achieves efficient replay-run through filtering the execution orders, which can be in- 
ferred from pending period information, from the determinism-log. Concretely, in the first-run, RepTFD 
records a relaxed start time and a relaxed end time for each instruction block with a global clock, where the 
period between these two time points is called the pending period of the instruction block. If two instruc- 
tion blocks have non-overlapping pending periods, they have a physical time order. Since execution orders 
between memory instructions in these two blocks can be inferred from the recorded pending period infor- 
mation (and the resultant physical time order), we need not to directly record them in the determinism-log. 
In fact, > 99% execution orders are inferrable from physical time orders [7 ]. As a consequence, the replay- 
run of RepTFD is seldom paused to enforce the directly recorded execution order^j thus only has 4.76% 
slowdown (in comparison to the normal execution without fault-tolerance), according to our experiments 
over SPLASH2 benchmarks. 

To sum up, this paper makes the following contributions: 

1 . Fault coverage: For parallel workloads, we propose the first core-level transient fault detection scheme 
with 100% coverage, i.e., any malignant transient fault that cause an error in the result of an instruction 
can be detected. 

2. Performance overhead: Due to the reduction of the execution orders to be enforced in the replay-run, 
RepTFD has only 4.76% performance overhead in comparison to the normal execution without fault- 
tolerance for 16-core CMP (8 checked cores and 8 redundant cores). In addition, RepTFD has the 
smallest replay slowdown among existing deterministic replay approaches. 

3. Implementation costs: Previous schemes have to resolve the possible input incoherence between the 
pair of checked core and redundant core, which incurs non-trivial implementation costs to the CMP. 
RepTFD elegantly avoids the troublesome input incoherence problem through deterministic replay, 

3 Noting that RepTFD only needs to maintain the determinism-log for the last segment of the first-run (about 1 second), the 
extremely tiny log size achieved by previous deterministic replay approaches has quite limited importance for RepTFD. In practice, 
the size of RepTFD's determinism-log for a 8-thread application is less than 15 MB when the instruction per cycle (IPC) of each 
core is 1 , which is already satisfactory for transient fault detection. 

4 Execution order is a type of logical time order, which exists between two successive conflicting memory instructions. Here we 
say two memory instruction conflicting, if they access the same memory location and at least one of them is write. 

3 Enforcing the recorded pending periods may also pause the replay-run. However, the pending periods recorded by RepTFD are 
quite relaxed (i.e., twice of the actual execution time of the instruction block), thus it is rare to pause the replay-run for enforcing 
the recorded pending periods. 
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and keeps most existing parts of the CMP unmodified. Experiments also show that RepTFD only 
consumes 0.83% area of the whole chip. Hence, RepTFD can be easily implemented on a commercial 
CMP with negligible design, verification, and area costs. 

The rest of this paper is organized as follows: Section [2] introduces some related investigations. Section 
[3] describes the principle and implementation of RepTFD. Section [4] presents some experimental results. 
Section|5]concludes the whole paper. 

2 Related Work 

2.1 Core-level Transient Fault Detection 

Researchers have proposed various core-level transient fault detection schemes. These schemes have sim- 
ilar ideas: Using a redundant core to protect each checked core from transient fault separately. When the 
checked core has different instruction results with that of the corresponding redundant core, a transient fault 
is detected. 

Core-level schemes require a checked core and the corresponding redundant core to have the same inputs 
ll22l l23l l34l . For single-threaded applications, this requirement can be straightforwardly satisfied through 
providing the same program and I/O inputs Gl|27]]. However, for multi-threaded applications, previous 
schemes meet the so-called input incoherence problem: Even given the same program and I/O inputs, the 
checked core and the redundant core may still get different results for a same load instruction, if a third-party 
core stores a new value to the corresponding memory location in the time gap between the load instructions 
performed by the checked core and the redundant core. To address this input incoherence problem, Q and 
|[T8l employ a load value queue (LVQ), through which every load value is forwarded from the checked core 
to the redundant core. However, the LVQ implementation has a high design complexity, especially when the 
load instructions are executed out-of-orderly. 

Instead of forwarding the load results of the checked core to the redundant core, some schemes (29j 
[T4l 1251 allow both cores to independently access the memory. Reunion ll29l treats input incoherence sim- 
ilar with transient faults. When an input incoherence is detected, roll-back recovery mechanism should be 
carried out. As the possibility of an input incoherence is much greater than that of a transient fault, Re- 
union may bring remarkable performance overhead (up to 250% for 8 checked cores according to [ 14]) for 
roll-back recovery from input incoherence. DCC [14J maintains a memory access window for each pair of 
check core and redundant core, to monitor the memory locations which are accessed by both cores. Con- 
flicting store instructions performed by a third-party core to those locations are stalled until both cores have 
ended their memory instructions. However, the performance overhead is still high (19.2% on average for 
8 checked cores according to ||25l ). Recently, TDB [25 ] is proposed to guarantee the input coherence via 
cache coherence protocol. Although it has better performance and scalability (with respect to the number 
of cores) than DCC, its design complexity and risk are quite high, since it needs to redesign a new cache 
coherence protocol. 

Even if the previous core-level transient fault detection schemes bring remarkable modifications to the 
existing parts of a CMP, they still leave the uncore parts, which consume more than 50% area of a commer- 
cial CMP, prone to suffer from transient faults. This is because that these schemes protect each checked core 
with a redundant core individually. Hence, they allow the checked core and the corresponding redundant 
core to load value from the same source (say, LLC or memory). If the source itself, contains incorrect values 
due to some transient fault in the uncore parts (e.g., the transfer from some core's LI cache to LLC), both 
the checked core and the redundant core will produce the same incorrect instruction results. As a result, 
no core-to-core comparison can find such a transient fault in the uncore parts. Furthermore, the previous 



4 



schemes have to deal with the possible input coherence between the pair of checked core and redundant 
core, which incurs remarkable design, implementation, and verification costs to the CMP. 

Essentially different with the previous core-level schemes, RepTFD treats all checked cores as an entire 
group, and uses a redundant group of cores, which has the same number of cores with the checked group, to 
protect the checked group as a whole. Since there is no value dependency between groups (i.e., a checked 
core and a redundant core will not load a same memory location), any transient fault can affect at most one 
group of cores, thus can be easily detected. To our best knowledge, RepTFD is the first core-level transient 
fault detection scheme with 100% coverage. Furthermore, RepTFD elegantly avoids the troublesome input 
incoherence problem through deterministic replay, thus does not bring non-trivial modifications to the ex- 
isting part of a CMP as the previous schemes. Moreover, RepTFD only incurs 0.83% overhead in the chip 
area and 4.76% performance overhead in comparison to the normal execution without fault-tolerance. To 
sum up, the effectiveness, elegancy, and efficiency of RepTFD demonstrate its applicability in industry. 

2.2 Circuit-level Transient Fault Protection 

Besides core-level, a chip can be protected from transient faults at the circuit-level through inserting redun- 
dant circuit-level cells (e.g., registers, cells, RAMs). For example, many chips for aerospace engineering 
adopt Triple Modular Redundancy (TMR), which uses additional two cells to protect one cell from transient 
faults. Theoretically, any single fault can be detected and corrected with TMR. Moreover, ECC and parity 
checking, which can protect the memory elements with information redundancy, can cope with transient 
fault tolerance in cache and memory. 

In general, circuit-level protection of transient faults may significantly increase the area of the chip, and 
slow down the speed. Hence, the usage of circuit-level protection is limited to cache and memory in most 
commercial CMPs. 

2.3 Protection of the Uncore Parts 

There have also been lots of approaches to make the uncore parts more reliable (especially the NoC). To 
tolerate incorrect behaviors of the NoC, BulletProof [5] and Vicis [|8] are proposed to detect and recover 
from the transfer loss. And HI EH provide some resilient routing algorithms to re-transfer the networks 
packets when a fault occurs in the NoC. Recently, [2] presents a cache coherence protocol framework which 
ensures coherence and reliability in the transfers even in the presence of hardware faults of NoC. 

However, the above approaches need non-trivial modifications to the existing uncore parts (as well as 
cores), which may bring additional design and re-verification overheads to the CMP. As a comparison, 
without any modification on the existing uncore parts, RepTFD can detect transient fault in the uncore parts 
as well as the cores. Moreover, The fact that a number of approaches have been proposed to protect some 
uncore parts also demonstrates the urgent need for a fault detection scheme which can cover not only the 
cores but also the uncore parts. 

2.4 Deterministic Replay 

Deterministic replay aims at guaranteeing two executions of a same program to exhibit the same in the 
presence of non-deterministic factors. An ideal deterministic replay approach for effective transient fault 
detection should bring non-remarkable slowdowns to both the first-run and the replay-run. While most 
of hardware-assisted deterministic replay approaches bring trivial slowdown to the first-run, few of them 
provides efficient replay-run (partly since they focus on pursuing a small log size, which is unimportant for 
fault detection) E[32j[T5j[T6j[3T]|4l. For example, Karma [4], which is a state-of-the-art hardware-assisted 
deterministic replay approach, still brings about 18 — 21% slowdown in the replay-run. 
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Figure 3: An example to illustrate pending period and physical time order, u and v are two memory instruc- 
tions executed on coreO and corel respectively, u is before v in physical time order because u's end time of 
(its pending period period) t e (u) is earlier than w's start time (of its pending period) t s (v). 



To achieve good performance in the replay-run, RepTFD employs pending period based deterministic 
replay originated from our previous conference paper [7]. However, to enable efficient replay-run, RepTFD 
introduced in this journal paper has the following superiorities compared to our conference paper Q: 1), 
RepTFD fixes the execution time of each instruction block (in the record-run), while fixes the number 
of instructions in each instruction block. Since the execution time of each block is unbounded in [7], some 
execution orders between blocks with overlapping pending periods cannot be recorded. As a result, f7 ] has 
to enforce these orders through sequentially replaying the program, which causes over 700% slow down 
in replaying a 8-thread program. In contrast, RepTFD records all non-inferrable execution orders that are 
between blocks with overlapping pending periods, thus enables replaying the program in parallel, which is 
an essential factor for efficient replay-run. 2), Even if [7 ] can replay the program in parallel, its replay-run 
still has larger overheads to enforce the execution orders non-inferrable from physical time orders, since the 
unbounded long execution time of block allowed by [7] brings more non-inferrable execution order (to be 
enforced in replay-run) than RepTFD. 

In summary, RepTFD significantly outperforms our previous conference paper [7 ] in the replay speed, 
which is crucial for fault detection. Moreover, to our best knowledge, the 4.76% replay slowdown of 
RepTFD is also the smallest among those of all existing deterministic replay approaches. 



3 Implementation of RepTFD 

3.1 Overview of RepTFD 

RepTFD uses one half of the cores in the CMP as the checked group of cores, and uses the other half of 
the cores as the redundant group of cores. These two groups of cores execute a same parallel program 
without data dependency. The first-run (on the checked group of cores) is divided into many segments using 
existing checkpoint mechanism as shown in Figure [2](b). In each segment of the first-run, two log files, 
which include determinism-log and result-log, are generated. The determinism-log can be used to enforce 
the replay-run (on the redundant group of cores) behaving the same with the first-run. And the result-log 
contains the instruction results of the first-run. 

After the first-run of the the m-th segment, the replay-run of the m-th segment starts on the redundant 
group of cores, and the results in the first-run and the replay-run are compared. Once there is a mismatch 
between the results of the first-run and the replay-run, a malignant transient fault is detected. Then both the 
checked group of cores and the redundant group of cores are rolled back to the most recent checkpoint to 
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recover from the transient fault. For example, if a fault is detected when replaying the m-th segment, both 
the first-run and the replay-run should be rolled backed to the (m-l)-th checkpoint to re-execute the m-th 
segment (noting that the replay-run should wait for the first-run to complete the re-execution of the m-th 
segment first). 

3.2 First-run 

As we have mentioned, in the first-run, two log files, i.e., determinism-log and result- log, should be recorded. 
3.2.1 Determinism-log 

For deterministic replay, the execution orders among memory instructions should be recorded as determinism- 
log. To achieve determinism, each recorded execution order (e.g., u — > v) should be enforced in the replay- 
run with dedicated synchronization: Before v is executed, the replay-run must detect and wait for the com- 
pletion of u. As a result, enforcing the recorded execution orders becomes the main performance overhead 
of the replay-run J7J HH [T7j . To implement efficient replay-run, RepTFD employs a pending period based 
recording approach to reduce the number of directly recorded execution orders that need to be enforced in 
the replay-run. 

Before introducing how RepTFD records execution orders, we need to introduce the concepts of pending 
period and physical time order, which were proposed in our previous investigations J6l|7l, as preliminaries. 
As shown in Figure |3j the pending period of an instruction is a relaxed time interval (denoted as [t s ,t e ]) on 
the global clock in which the instruction is globally performed. For a CMP with a global clock, the pending 
periods of any two instructions u and v can be compared. If their pending periods do not overlap, there exists 
a physical time order between these two instructions: If the end time of u's pending period (also denoted as 
u's end time for brevity) is earlier than the start time of v's pending period (also denoted as u's start time for 
brevity), then we say that u is before v in physical time order. 

In [6], Chen et al. proved that if an instruction u is before another instruction v in physical time order, 
v cannot be before u in any kind of logical time order. The reason is quite intuitive: instruction u must 
have been globally performed before that v begins to execute, thus v is impossible to affect the result of u. 
Therefore, if the physical time order between a pair of two memory instructions is known according to the 
pending period information, the logical time order (including execution order) between two instructions can 
be inferred. In fact, most (> 99%) execution orders can be inferred from the pending period information 
according to 171 . 

Therefore, instead of directly recording all execution orders, RepTFD records the pending period in- 
formation and the non-inferrable execution orders (i.e., execution orders that can not be inferred from the 
pending period information ). To reduce the overhead, RepTFD records the pending period of instruction 
block instead of individual instruction^] Through sampling, each thread is naturally divided into instruction 
blocks, where the i-th instruction block consists of instructions whose global-perform times are between 
(i-l)-th and i-th samplings. Noting that some instructions in the i-th block (whose global-perform time is 
later than the (i-l)-th sampling) may begin earlier than the (i-l)-th sampling. Hence, the pending periods 
of the i-th block is between the (i-2)-th and i-th samplings. Say, the length of the pending period of each 
block (two sampling spans) is twice of the actual execution time of the block (one sampling span). As shown 
in Figure [4j instructions in block A are those instructions on coreO, whose global-perform times are in the 
region of [ti, t$], while the pending period of block A is [£3, £5]. 

6 The pending period of an instruction block is the time interval between the start time of the first start instruction in the block, 
and the end time of the last end instruction in the block. 
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Figure [4] provides an illustrative example about which execution orders should be recorded. Block A is 
an instruction block executed on coreO, and block B, C, D, E, F are five consecutive instruction blocks 
executed on corel. The pending periods of block A, B,C, D, E, F are [£3^5], [ti,t3], [^2,^4], [^3,^5], 
[£4,tg]> [^5^7] respectively. With pending period information, the red dashed execution orders, i.e., the 
execution orders between memory instructions in block A and memory instructions in block B need not to 
be recorded. These execution orders are inferrable, because memory instructions in block B have globally 
performed at sampling time t% while memory instructions in block A have not started. Similarly, all the 
green dashed execution orders are also inferrable. Actually, only the execution orders between memory 
instructions in block A and memory instructions in block C, D and E (the bold execution orders) are non- 
inferrable and thus should be recorded. 

Sampling time 



Core 



Core 1 




t 7 



BLOCK B 


BLOCK C 


BLOCK D 


BLOCK E 


BLOCK F 



Figure 4: Pending period of instruction blocks via sampling. The dashed arrows represent the execution 
orders inferrable from the pending period information, while the bold arrows represent the execution orders 
that cannot be inferred from the pending period information. 

Concretely, to record the pending period information in the determinism-log, RepTFD employs a PC- 
sampling technique. At a sampling time T, instructions that have committed out of the instruction window 
should have a end time before T. Meanwhile, instructions that have not entered the instruction window 
should have a start time after T. Thus, we can obtain the pending period of each instruction by observing 
the instructions in the instruction window at every sampling time. 

To record these non-inferrable execution orders (i.e., execution orders that can not be inferred from the 
pending period information), RepTFD employs a CAM for each core to save the accessed addresses of 
memory instructions in the last block and the current block of the core. If a local LI miss occurs when 
executing instruction u, the address of u is searched in the CAMs for other cores to check whether there is a 
conflicting memory instruction. If u hits in the accessed address of some instruction v saved in the CAMs, 
the information of u and the information of v should be both recorded as an non-inferrable execution order 
v — > u. Take Figure [4] for instance. At the time between £4 and t§, coreO is executing block A and corel 
is executing block D. Memory operations in block C and D should be stored in the CAM for corel. If 
a LI miss occurs on coreO when executing an instruction in block A, it searches the CAM for corel to 
see whether there is a conflicting memory instruction. In this manner, all non-inferrable execution orders 
from instructions in block C or D to instructions in block A can be recorded. Similarly, all non-inferrable 
execution orders from instructions in block A to instructions in block D or E can also be recorded when 
corel checks the CAM of coreO. As a result, all non-inferrable execution orders, which lie between blocks 
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with overlapping pending periods, are recorded. 

Notably, the length of the sampling span is closely related to the replay speed. As shown in Figure [4] the 
pending period of an instruction block (two sampling spans) is the relaxation of the accurate execution time 
of the block (one sampling span). Hence, the higher sampling frequency, the more accurate the pending 
period is, and the less relaxation for each block. As a result, for a shorter sampling span, the replay-run 
needs to pay more efforts to enforce physical time orders. On the other hand, a longer sampling span will 
cause lesser inferrable execution orders, which means the replay-run needs to pay more efforts to enforce 
non-inferrable execution orders. In practice, RepTFD configures the sampling span to be 512 clock cycles 
as a tradeoff between the efforts to enforce physical time orders and to enforce non-inferrable execution 
orders in the replay-run. 

3.2.2 Result-log 

The result-log needs to record the instruction results in the first-run. Without lost of generality, assume 
that each instruction needs to modify a 32-bit register or a 32-bit memory location. However, recording 32 
bits for each instruction may need hundreds of GByte per second, which is unpractical for state-of-the-art 
I/O and memory interfaces. To reduce the size of the result-log, we utilize a technique called CheckSum, 
which can significantly reduce the amount of data with lossy compression. 



Algorithm 1: CheckSum Technique 

module CheckSum;; 
begin 

input [31:0] result; /*result of an instruction*/; 
input new ^commit; /*commit an instruction*/; 
input [63:0] commit .instructions;; 
output reg [31:0] checksum;; 

output out-valid; /*valid when outputs the checksum*/; 

always® (posedge clock); 
if (reset \ \ out -valid) then 

out_valid <= 1'bO;; 

checksum <= 31'b0;; 
^initialization*/; 

else 

if (new .commit) then 
^ checksum <= checksum XOR result;; 

if (commit -instructions%1024==0) then 
out -valid <= I'M;; 
export(checksum, result Jog);; 



As shown in Algorithm [T] RepTFD employs a 32-bit checksum register and a few gates for each core 
to summarize the instruction results on the core. The value of each checksum register is initialized to at 
reset. When committing an instruction, the new value of the checksum register is changed to the the result 
of the instruction XOR the old value of the checksum register. For every 1024 instructions, the out_valid 
register is set to be I'bl. At such time, the current value of the checksum register is exported to the result- 
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log, and then re-initialized. Obviously, the size of the CheckSum result-log is quite small: it only consumes 
4 byte per kilo instructions. Furthermore, besides CheckSum, other lossy compression approaches can also 
be applied to reduce the size of the result-log. 

3.3 Replay-Run 

In the replay-run, the same program is replayed on another half cores of the CMP (the redundant group 
of cores) according to the determinism-log. In addition, the instruction results of the replay-run should be 
compared dynamically with the result-log recorded in the first-run. 

3.3.1 How to Replay 

For faithfully replaying, two types of information recorded in the determinism-log should be enforced 
in the replay-run. One is the pending period information (as well as the resultant physical time orders). The 
other is the non-inferrable execution orders. 

To enforce the physical time orders, in the replay-run, an instruction block A cannot be executed unless 
all instruction blocks before A in physical time order have ended their executions. Instead of analyzing the 
pending period information to get detailed physical time orders, we utilize a "grant" array in the replay-run 
to enable efficient replay. Each element of the grant array represents the state of a sampling time. Concretely, 
the array element of each sampling time s has a value initialized to 0. This value is increased by 1 when a 
thread finishes all of its instruction blocks whose pending periods end no later than s. Any instruction block, 
whose pending period starts no earlier than s in the first-run, cannot be executed in the replay-run, unless the 
value of array element corresponding to s has been increased to p (the number of threads), which indicates 
that all threads have granted the execution of instruction block starting later than s. In this way, all physical 
time orders are guaranteed in the replay-run with negligible costs. 

In practice, three registers should be assigned for each core to implement the above idea. The first 
register, denoted by next start, is the start time of the next block's pending period. The second register, 
denoted by curr.end, is the end time of the current block's pending period. And the third register, denoted 
by next-end, is the end time of the next block's pending period. 

Algorithm [2] is the detailed replay algorithm to enforce the physical time orders. grant(i) is the afore- 
mentioned grant array, which represents how many cores have finished their instruction blocks whose pend- 
ing periods end no later than s. When an instruction block ends, the grant values of sampling time between 
curr_end and nextsnd are increased by 1, since the corresponding core have finished all of its instruction 
blocks whose pending periods end no later than these periods. On the other hand, when a new instruc- 
tion block is about to start, the grant value of its start time, grant{next start) in the algorithm, should be 
checked. This block can be executed only if the grant value equals to the total number of cores p. Otherwise, 
it should wait and stall its execution. 

Besides the execution orders indirectly guaranteed by the physical time orders, the directly recorded non- 
inferrable execution orders should also be enforced in the replay-run. Consider an execution order u — > v 
whose correlative instructions u and v are executed on coreO and corel respectively. In the replay-run, corel 
cannot execute instruction v until u has been committed by coreO. Hence, for each core, RepTFD inserts 
a dedicated execution order buffer to import execution order information^] and a single bit to pause the 
progress of the core. With the information in the execution order buffer, corel, which executes instruction v, 
knows that it should wait for the completion of u first. If u has not been completed, the pause bit of corel is 
set, thus corel pauses its own progress. When u has been accomplished by coreO, coreO will acknowledge 

'According to our experiments, for every 10,000 instructions, there is only less than 1 non-inferrable execution order directly 
recorded in the determinism-log on average. Thus, only a tiny buffer (e.g., 32 byte) is enough to buff the recorded execution orders. 
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Algorithm 2: Replay for Physical Time Orders 



module replay _p;; 
begin 

reg [7:0] next_start, [7:0] curr_end, [7:0] next_end;; 

/*the three mentioned registers*/; 
reg [7:0] next_start_new, [7:0] next_end_new;; 

/*to update the value when a block ends or starts*/; 
if (end .a Mock) then 

for (% = curr_end; i Knext^end; i + +) do 
| grant(i)++y, 

curr.end = next.end;; 
next_end = next_end_new;; 

if (start _a Mock) then 

if (grant(next start) == p) then 
/*p is the number of cores*/; 

start _a_new .block;; 
next_start = next_start_new;; 

else 

| pause.the .progress; 



corel. As a result, Corel's pause bit is cleared, and v can be executed by corel. In this way, all execution 
orders directly recorded in the determinism-log can be guaranteed. 

As mentioned, the major concern about the replay-run is the speed overhead, which is produced when an 
instruction is stalled to enforce some order (either physical time order or non-inferrable execution order) with 
other instruction. To enforce the physical time orders in the replay-run, RepTFD only needs to guarantee 
that each block can be executed within the given pending period. Recall that the length of the pending period 
for each instruction block is two sampling spans, while the actual execution time of each block is only one 
sampling span. As a result, only a few blocks should be stalled in the replay-run to enforce the recorded 
pending periods and physical time orders. Meanwhile, the overhead for enforcing the directly recorded 
non-inferrable execution orders is also low, since the amount of non-inferrable execution orders is quite few 
(only < 1% execution orders are non-inferrable from the physical time orders). To sum up, RepTFD will 
not bring remarkable slowdown in the replay-run. 

Take Figure [5] for example to illustrate RepTFD's low performance overhead of replay-run. Block A, 
B, C, D are four consecutive blocks executed on coreO, and Block E, F, G, H are four consecutive blocks 
executed on corel. The actual execution time of each block in the first-run is one sampling span (512 cycles). 
The arrows represent the physical time orders between these blocks in the first-run. For example, block E is 
before block A in physical time order, since the pending periods of block A and block E are from the 101-st 
sampling to the 103-rd sampling, and from the 99-th sampling to the 101-st sampling respectively. Suppose 
each block on coreO has the same execution time in the replay-run as in the first-run. In the replay-run, for 
block A, B, and D, their executions are not stalled, since block E, F, and H are accomplished before block 
A, B, and D start respectively. The only exception is block C, which is stalled since the end time of block G 
is delayed for more than 512 clock cycles (longer than one sampling span). According to our experiments, 
we found that only 14.17% instruction blocks on average are stalled to enforce the physical time orders, and 
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Figure 5: An example to illustrate the performance loss to enforce the physical time orders. Each block 
takes one sampling span (512 cycles) to execute in the record-run, while their pending periods are two 
sampling spans (1024 cycles). In the replay-run, only block C will be stalled to enforce the physical time 
order G — > C since the end time of block G is delayed 512+ clock cycles. 
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Figure 6: Hardware Implementation for RepTFD. coreO - core3 are the checked group of cores. And core4 
- core7 are the redundant group of cores. 



the average stalled time is only 21.75 clock cycles, which brings very low overhead to the performance of 
the replay -run. 

3.3.2 How to Check Results 



The results in the replay-run are expected to be the same as those in the first-run. Therefore, through 
comparing the results of two runs, transient faults can be detected. RepTFD adopts dynamical comparison, 
which compares the execution results via dynamically importing the values in the result-log. 

Specifically, the results in the replay-run should be lossily compressed with the same algorithm as in 
the first-run (c.f . in Algorithm [T}. Hence, each redundant core for the replay-run should also generate a 
checksum result for every 1024 instructions. Once the checksum result is generated in the replay-run, it 
should be compared with the corresponding checksum result generated in the first-run (which is saved in the 
result-log). Obviously, such comparison has only trivial hardware costs, and does not affect the performance 
of the replay-run. 

It is worth noting that, due to lossy compression there may be a rare case that the first-run and the 
replay-run have the same checksum result of the instruction results, but their instruction results are different. 
In such a rare case, a transient fault may not be detected immediately. Such theoretical problem is met by 
many previous transient fault detection schemes ll29l[T4ll25ll . However, a transient fault, even not detected 
immediately, will be finally detected by comparing the results of future instructions affected by the fault 

tm. 
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3.4 Hardware Supports of RepTFD 



In this subsection, we present hardware modifications on the RTL design of an industrial CMP, which is 
named Godson-3, to implement RepTFD. For the existing design of Godson-3 (whose architecture features 
can be found in Table [T]), RepTFD only trivially modifies each core as follow 1) adding a 64-bit registers 
for each core to count the number of committed memory instructions, 2) implementing a pause bit for each 
core to pause its progress to guarantee orders in the replay-run. Most existing components of Godson- 
3 (including L2 cache, memory controller, cache coherence protocol, and switch+mesh interconnection) 
remain unmodified. 



Table 1: Architecture Features of Godson-3. 



Number of Processor Cores 


2 — 16 cores 


Frequency 


1.0- 1.5 GHz 


Network 


Crossbar+Mesh 


Pipeline 


Four-issue, nine-stage, out-of-order 


Functional Unit 


Two 64-bit fix-point units, two 64-bit floating-point units, one 128-bit memory unit 


Register File 


32 logical registers and 64 physical registers for fix-point and floating-point respectively 


LI dcache 


Private, 4 way, 32KB, writeback, 32B per line, load-to-use 3(fix)/4(ft) cycles 


LI icache 


Private, 4 way, 64KB, 32B per line 


L2 Cache 


Unified address, Share, 4-16 bank, 512K-1MB per bank, 4 way, 32B per line, 30-35 cycles latency 


Memory 


1-4 DDR2/3 controller, > 1GB per controller, 333MHz frequency, ~ 120 cycles latency 


Cache Coherence 


Directory-based MSI protocol 



Figure [6] shows the detailed implementation of RepTFD. coreO - core3 are the checked group of cores. 
And core4 - core7 are the redundant group of cores. For the top half of Figure [6] the checked group of 
cores needs the following hardware supports to generate the determinism-log and the result-log. (Note that 
most of these hardware supports are decoupled from the existing components of Godson-3, thus do not bring 
modifications to these existing components.) 

1. A counter for each checked core, which copes with PC-sampling to count the committed memory 
instructions at every sampling time to record the pending period information. As PC-sampling is 
already supported by most commercial CMPs, recording the pending period information just needs 
very few cost. 

2. A CAM for each checked core, which stores the memory operations in the last executed block and 
the current executing block. The size of the CAM is 1024 x 27, because there are at most 1024 
instructions (two sampling spans) to be stored and each memory instruction needs 27 bits (including 
the type, address, counter, cache hit). 

3. Some logic to record the non-inferrable execution orders: When a LI miss occurs, the address of the 
corresponding instruction is searched in the CAMs for other cores. If it hits, the core number and inst 
counter of this instruction and the hit instruction in the CAM should be both recorded. 

4. A CheckSum module for each checked core to record the result-log. This module receives the in- 
struction results from the core and then processes the results using lossy compression. As shown in 
Algorithm [TJ this module consumes less than 100 bit registers. 

5. Some logic to export the logs out of the chip. 

As shown in the bottom half of Figure [6] the following hardware supports are needed by the the redun- 
dant group of cores to replay the determinism-log and check the result-log. 
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1 . A pause bit for each core to pause its progress to guarantee orders in the replay-run. 

2. A replay module for each redundant core to enforce both the physical time orders and the non- 
inferrable execution orders. When an instruction should wait to enforce some order, the replay module 
sends a signal to pause the progress of the corresponding core. To enforce the physical time orders, 
several hundreds bit registers are consumed for each core to implement Algorithm[2] In the meantime, 
only a 16-entry buffer is required to enforce the non-inferrable execution orders. 

3. A global grant array for the redundant group of cores to enforce the physical time orders. This array 
cooperates with the replay module for each redundant core according to Algorithm [2] 

4. A CheckSum module for each redundant core to check the instruction results of the first-run. 

5. Some logic to import the recorded logs into the chip. 



To sum up, the main design overhead of RepTFD is a 1024 x 27 CAM per checked core (it is also 
feasible to replace the CAM with space-efficient Bloom filter), thus brings only a few costs to the chip area. 
Moreover, RepTFD has a very low design complexity because most components of the existing CMP design 
remain unmodified. 



4 Experimental Results 

We implement RepTFD on the RTL design of the 8-core Godson-3, and carry out experiments over SPLASH2 
benchmarks to validate the efficiency and effectiveness of RepTFD. Table [T] lists the detailed features of 
Godson-3 as Q. In the experiments, when we mention "the performance of RepTFD", we refer to the 
performance of redundantly executing an 8-thread application on 16-core CMP (the first-run and the replay- 
run of the application are simultaneously executed on 8 checked cores and 8 redundant cores respectively). 
When we mention "the baseline performance", we refer to the performance of executing an 8-thread appli- 
cation on 16-core CMP solely. Figure [7] presents the performance of RepTFD normalized to the baseline 
performance with 512-cycle sampling span. The average performance overhead of is 4.76% over bench- 
mark^] For benchmark water, the performance overhead is even as low as 0.16%. 

In general, the low performance overhead of RepTFD comes from two factors: the low cost to enforce 
non-inferrable execution orders, and the low cost to enforce physical time orders. RepTFD pays negligible 
cost to enforce non-inferrable execution order, since there is quite few non-inferrable execution orders that 
need to be enforced in the replay-run. Figures [8] shows the number of non-inferrable execution orders per 
10000 instructions (per thread). Averagely, only 1.47 execution orders should be enforced for every 10000 
instructions. For benchmark water, only 0.014 non-inferrable execution orders per 10000 instructions should 
be enforced in the replay-run. Thus, negligible cost is spent on enforcing the recorded execution orders. 

On the other hand, RepTFD can cost-effectively enforce physical time orders, since the recorded pending 
period for each instruction block is quite relaxed (twice as the actual execution time of the block). As 
illustrated in Figure [9] and Fi gureflOj in the replay-run, only 14.17% instruction blocks are stalled to enforce 



the physical time orders, while for each stalled block, the average stalled time is only 21.75 clock cycles. 
In other words, for all 512-cycle instruction block, we only need to stall 3.08 cycles (14.17% x 21.75) per 
block averagely, which brings about 0.6% (3.08/512) performance penalty in the replay-run. For benchmark 
water, the performance overhead is even lower: Only 5.79% instruction blocks are stalled and the average 
stalled time for each stalled block is only 2.61 clock cycles. 



8 As references, two state-of-the-art transient fault detection schemes Reunion |29| and DCC 1 14| have 5% — 250% and 19.2% 
performance overheads for 16-core systems (8 checked cores + 8redundant cores) respectively. 
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Figure 7: Normalized speed overhead for 8 threads in the replay -run. 
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Figure 8: Number of non-inferrable execution orders (per 10k instructions in one thread). 
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In addition, we also evaluate the size of the determinism-log. RepTFD should record the detailed number 
of instructions in each block, thus has larger log size than our previous conference paper LReplay 0, which 
has fixed number of instructions in each block. As shown in Figure [TTJ the average log size of this journal 
paper's determinism-log is 1.87 byte per kilo instructions, while LReplay 's log size is only 0.36 byte per kilo 
instructions (with 512-cycle sampling span). However, RepTFD's log size is still acceptable for an 8-thread 
fault tolerant CMP: Only 14.96 MB is required to record/replay a 1-second segment of an 8-thread program 
(when the IPC is 1). 
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Figure 11: Log size in comparison to LReplay 0. 



Moreover, we evaluate the area consumption of RepTFD. The main area overhead of RepTFD is a 
1024 x 27 CAM per checked core, which consumes about 5.0 square millimeter as a whole, according to 
the report of Synopsys's Design Compiler with STMicro 65nm GP/LP mixed process. Note that the overall 
area of our 16-core CMP (under the same process) is about 600 square millimeter, the area consumption of 
RepTFD is only about 0.83% of the whole chip. 



5 Conclusion 

In this paper, we propose RepTFD, a core-level transient fault detection scheme which employs deterministic 
replay to achieve 100% detection coverage. Different from existing core-level schemes which can only 
protect each core separately, RepTFD protects both the cores and the uncore parts without modifications to 
the uncore parts of the chip. To avoid remarkable performance overhead in the replay-run, RepTFD records 
the relaxed pending period for each instruction block (whose length is twice of the actual execution time 
of the block) in the first-run. Through cost-effectively enforcing the resultant physical time orders between 
blocks in the replay-run, > 99% execution orders, which can be inferred from physical time orders, do not 
need to be enforced anymore. As a result, RepTFD incurs only 4.76% slowdown (in comparison to the 
normal execution without fault-tolerance). 
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RepTFD is the first practice to tackle transient fault detection by cutting-edge deterministic replay tech- 
niques. We uncover that through there are many previous investigations on detecting transient faults on 
CMPs, few of them can do this job as efficient, effective, and elegant as deterministic replay. On the other 
hand, the requirements aroused in transient fault detection call for a hardware-assisted deterministic replay 
approach different with those existing log-size-oriented replay approach. It can motivate further improve- 
ments on deterministic replay. 
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